[historyserver][collector] Add file-level idempotency check for prev-logs processing on container restart #4321

my-vegetable-has-exploded · 2025-12-30T03:11:32Z

Log persistence can fail if the collector container crashes or restarts. To ensure reliability, the collector must resume processing any logs left in the prev-logs directory upon recovery.

In this PR, we enhanced the WatchPrevLogsLoops() function to perform an initial scan of the prev-logs directory at startup before entering the watch loop. This ensures that any session or node folders left from previous runs (e.g., after a container restart) are correctly processed and uploaded.

Why are these changes needed?

Related issue number

Closes #4281

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: my-vegetable-has-exploded <[email protected]>

my-vegetable-has-exploded · 2025-12-30T11:31:34Z

historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go

 	}
 }

+// isFileAlreadyPersisted checks if a file has already been persisted to persist-complete-logs


isFileAlreadyPersisted detects files already moved to persist-complete-logs and avoid duplicate uploads.

Future-Outlier

Hi, thank you for this PR!
do you mind provide a reproduction script like this for me to test on my env more easily?

https://github.com/ray-project/kuberay/blob/master/historyserver/docs/set_up_collector.md#test-the-collector-on-the-kind-cluster

JiangJiaWei1103 · 2026-01-01T07:44:25Z

Hi @my-vegetable-has-exploded, thanks for your contribution.

Would you mind modifying the PR title and description to clarify that this feature is actually for container restart, not pod restart? I believe this works only for cases in which the collector sidecar fails and recovers. For now, the data is permanently erased if the head pod restarts since we use an emptyDir volume.

If I have misunderstood anything, please correct me. Thanks!

justinyeh1995 · 2026-01-01T11:24:40Z

historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go

 	r.prevLogsDir = "/tmp/ray/prev-logs"
+	r.persistCompleteLogsDir = "/tmp/ray/persist-complete-logs"


Could we move these to constant in historyserver/pkg/utils/utils.go similar to KunWuLuan@9b2ca52?

Hi @justinyeh1995,

Thanks for pointing this out. I’m currently working on it (handling all constants at once) and will open a PR soon. I’m not sure whether it should be included here or handled in a separate PR.

no problem. I wasn't aware of that when I wrote the comment. Is the pr related to any issue though? I think I need to catch up quite a bit.

You can refer to this issue. Thanks a lot!

Thanks for heads up!

Future-Outlier

plz also add e2e test in this PR.

my-vegetable-has-exploded · 2026-01-05T07:29:53Z

plz also add e2e test in this PR.

Get! I would add it tonight.

Future-Outlier

cc @lorriexingfang to review

lorriexingfang · 2026-01-06T18:15:25Z

historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go

nit: consider using filepath functions to handle file path.

Thanks for review. There are some same issues in this file. Maybe we can file a new ticket to handle it?

agreed to handle it as a followup

Hi, @my-vegetable-has-exploded
do you mind create an issue so that I can assign to others?

no problem.

#4393 cc:@Future-Outlier

Future-Outlier

cursor review

Future-Outlier · 2026-01-07T04:13:24Z

bugbot run

Future-Outlier · 2026-01-07T04:13:37Z

cursor review

historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go

Signed-off-by: my-vegetable-has-exploded <[email protected]>

historyserver/test/e2e/collector_test.go

Signed-off-by: my-vegetable-has-exploded <[email protected]>

historyserver/test/e2e/collector_test.go

Signed-off-by: my-vegetable-has-exploded <[email protected]>

historyserver/test/e2e/collector_test.go

Signed-off-by: my-vegetable-has-exploded <[email protected]>

Future-Outlier · 2026-01-09T08:57:20Z

cc @JiangJiaWei1103 @CheyuWu @seanlaii to test this PR locally before you approve this, thank you!

Signed-off-by: my-vegetable-has-exploded <[email protected]>

historyserver/test/e2e/collector_test.go

JiangJiaWei1103 · 2026-01-09T15:02:19Z

historyserver/test/e2e/collector_test.go

+	// 2. Inject "leftover" logs into prev-logs via the ray-head container while collector is down.
+	// Note: We only inject logs, not node_events, because the collector's prev-logs processing
+	// currently only handles the logs directory. node_events are handled by the EventServer separately.


Just curious. How can we ensure that the injection is completed during the collector downtime?

Because Kubernetes restarts the collector immediately after kill 1, I can’t absolutely guarantee the process stays down until injection finishes. But the injection is short-lived, so current approach works well in my local environment.
If you have a better way to handle this, I’d be glad to implement it.

I agree that the injection completes quickly while the collector takes some time to become ready, and this approach works in my local environment as well. At the moment, I don’t have a more robust solution in mind.

cc @seanlaii @CheyuWu Do you have any suggestions? Thanks so much!

historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go

JiangJiaWei1103 · 2026-01-10T03:16:08Z

historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go

 		return
 	}

-	// Check if this directory has already been processed by checking in persist-complete-logs


No changes are needed here. This note is just to clarify that we still need to check for leftover log files since the presence of log directories under persist-complete-logs directory alone does not guarantee that all logs have been fully persisted.

Thanks for the review! If the Collector crashes halfway through a node directory, persist-complete-logs already exists but some logs remain in prev-logs. Keeping this check would cause the Collector to skip the entire directory upon restart, leading to data loss. The current file-level check isFileAlreadyPersisted will handles this "partial success" scenario correctly without duplicate uploads.

Replacing it with following codes then you can trigger this case using cd historyserver && go test -v ./pkg/collector/logcollector/runtime/logcollector/ -run TestScanAndProcess.@JiangJiaWei1103

completeDir := filepath.Join(r.persistCompleteLogsDir, sessionID, nodeID, "logs") if _, err := os.Stat(completeDir); err == nil { logrus.Infof("Session %s node %s logs already processed, skipping", sessionID, nodeID) return }

And e2e test also covers this case.

Wow, thanks for the detailed explanation! I originally just wanted to leave a brief note for maintainers to get why we removed the code snippet here. All tests pass in my local env. Thanks!

historyserver/test/e2e/collector_test.go

my-vegetable-has-exploded · 2026-01-10T08:37:54Z

historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go

 	}

 	// Walk through the logs directory and process all files
 	err := filepath.WalkDir(logsDir, func(path string, info fs.DirEntry, err error) error {


When collector restart, WatchPrevLogsLoops will call processPrevLogsDir, so we can process leftover log files in prev-logs/{sessionID}/{nodeID}/logs/ directories.

Co-authored-by: 江家瑋 <[email protected]> Signed-off-by: yi wang <[email protected]>

Signed-off-by: my-vegetable-has-exploded <[email protected]>

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier · 2026-01-14T06:53:48Z

historyserver/docs/set_up_collector.md

+
+## Troubleshooting
+
+### "too many open files" error
+
+If you encounter `level=fatal msg="Create fsnotify NewWatcher error too many open files"` in the collector logs,
+it is likely due to the inotify limits on the Kubernetes nodes.
+
+To fix this, increase the limits on the **host nodes** (not inside the container):
+
+```bash
+# Apply changes immediately
+sudo sysctl -w fs.inotify.max_user_instances=8192
+sudo sysctl -w fs.inotify.max_user_watches=524288
+```
+
+To make these changes persistent across reboots, use the following lines:
+
+```text
+echo "fs.inotify.max_user_instances=8192" | sudo tee -a /etc/sysctl.conf
+echo "fs.inotify.max_user_watches=524288" | sudo tee -a /etc/sysctl.conf
+sudo sysctl -p
+```


very cool, never heard this, I learned something

Future-Outlier · 2026-01-14T06:55:09Z

historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go

Hi, @my-vegetable-has-exploded
do you mind create an issue so that I can assign to others?

Future-Outlier

LGTM, just revise it, thank you!
cc @rueian to merge

my-vegetable-has-exploded added 4 commits December 29, 2025 13:21

feat(historyserver):re-push prev-logs on pod restart

4363168

Signed-off-by: my-vegetable-has-exploded <[email protected]>

chroe(historyserver): replace hard code path

9476fbb

Signed-off-by: my-vegetable-has-exploded <[email protected]>

fmt.

9d8456a

Signed-off-by: my-vegetable-has-exploded <[email protected]>

test(historyserver): add test for logcollector restart

1404f14

Signed-off-by: my-vegetable-has-exploded <[email protected]>

my-vegetable-has-exploded marked this pull request as ready for review December 30, 2025 11:30

my-vegetable-has-exploded commented Dec 30, 2025

View reviewed changes

Future-Outlier reviewed Dec 30, 2025

View reviewed changes

JiangJiaWei1103 moved this to In review in My Kuberay & Ray Dec 30, 2025

JiangJiaWei1103 added this to My Kuberay & Ray Dec 30, 2025

justinyeh1995 reviewed Jan 1, 2026

View reviewed changes

my-vegetable-has-exploded changed the title ~~feat(historyserver):re-push prev-logs on pod restart~~ feat(history server):re-push prev-logs on container restart Jan 4, 2026

Future-Outlier reviewed Jan 5, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into re-push-prev-logs

63cc677

Future-Outlier reviewed Jan 6, 2026

View reviewed changes

lorriexingfang reviewed Jan 6, 2026

View reviewed changes

Future-Outlier reviewed Jan 7, 2026

View reviewed changes

cursor bot reviewed Jan 7, 2026

View reviewed changes

historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go Outdated Show resolved Hide resolved

add e2e test for repush

0467368

Signed-off-by: my-vegetable-has-exploded <[email protected]>

cursor bot reviewed Jan 7, 2026

View reviewed changes

historyserver/test/e2e/collector_test.go Outdated Show resolved Hide resolved

historyserver/test/e2e/collector_test.go Outdated Show resolved Hide resolved

my-vegetable-has-exploded added 2 commits January 7, 2026 16:27

fix e2e test.

7d45c03

Signed-off-by: my-vegetable-has-exploded <[email protected]>

add Troubleshooting.

1f59c99

Signed-off-by: my-vegetable-has-exploded <[email protected]>

JiangJiaWei1103 reviewed Jan 7, 2026

View reviewed changes

historyserver/test/e2e/collector_test.go Outdated Show resolved Hide resolved

my-vegetable-has-exploded added 4 commits January 8, 2026 02:07

rm redundant cleanup.

a04b9d7

Signed-off-by: my-vegetable-has-exploded <[email protected]>

reuse WatchPrevLogsLoops to scan existing logs.

38812d7

Signed-off-by: my-vegetable-has-exploded <[email protected]>

simulate partial upload in e2e test.

56ff9b4

Signed-off-by: my-vegetable-has-exploded <[email protected]>

fix unit test.

75566a2

Signed-off-by: my-vegetable-has-exploded <[email protected]>

cursor bot reviewed Jan 9, 2026

View reviewed changes

historyserver/test/e2e/collector_test.go Outdated Show resolved Hide resolved

fix lint

9221529

Signed-off-by: my-vegetable-has-exploded <[email protected]>

fix mv race condition in e2e test.

06e0ddd

Signed-off-by: my-vegetable-has-exploded <[email protected]>

JiangJiaWei1103 reviewed Jan 9, 2026

View reviewed changes

CheyuWu self-requested a review January 10, 2026 01:49

JiangJiaWei1103 reviewed Jan 10, 2026

View reviewed changes

my-vegetable-has-exploded commented Jan 10, 2026

View reviewed changes

my-vegetable-has-exploded and others added 3 commits January 10, 2026 16:58

Apply suggestion from @JiangJiaWei1103

8f4f4e6

Co-authored-by: 江家瑋 <[email protected]> Signed-off-by: yi wang <[email protected]>

address comments.

8de439d

Signed-off-by: my-vegetable-has-exploded <[email protected]>

e2e test: add assertions and update description

a8604d0

Signed-off-by: my-vegetable-has-exploded <[email protected]>

Future-Outlier self-assigned this Jan 12, 2026

Future-Outlier moved this to collector in @Future-Outlier's kuberay project Jan 12, 2026

Future-Outlier added this to @Future-Outlier's kuberay project Jan 12, 2026

Future-Outlier added 2 commits January 14, 2026 15:56

Merge remote-tracking branch 'upstream/master' into re-push-prev-logs

70da4d6

Better test

38486d0

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier reviewed Jan 14, 2026

View reviewed changes

Future-Outlier approved these changes Jan 14, 2026

View reviewed changes

Future-Outlier moved this from collector to can be merged in @Future-Outlier's kuberay project Jan 14, 2026

Future-Outlier moved this from can be merged to Done in @Future-Outlier's kuberay project Jan 14, 2026

Future-Outlier moved this from Done to can be merged in @Future-Outlier's kuberay project Jan 14, 2026

Future-Outlier changed the title ~~feat(history server):re-push prev-logs on container restart~~ [historyserver][collector] Add file-level idempotency check for prev-logs processing on container restart Jan 14, 2026

		r.prevLogsDir = "/tmp/ray/prev-logs"
		r.persistCompleteLogsDir = "/tmp/ray/persist-complete-logs"

[historyserver][collector] Add file-level idempotency check for prev-logs processing on container restart #4321

Are you sure you want to change the base?

[historyserver][collector] Add file-level idempotency check for prev-logs processing on container restart #4321

Conversation

my-vegetable-has-exploded commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

JiangJiaWei1103 commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinyeh1995 Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

my-vegetable-has-exploded commented Jan 5, 2026

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Future-Outlier commented Jan 7, 2026

Uh oh!

Future-Outlier commented Jan 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Future-Outlier commented Jan 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

my-vegetable-has-exploded Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JiangJiaWei1103 Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

my-vegetable-has-exploded commented Dec 30, 2025 •

edited

Loading

JiangJiaWei1103 commented Jan 1, 2026 •

edited

Loading

justinyeh1995 Jan 1, 2026 •

edited

Loading

my-vegetable-has-exploded Jan 11, 2026 •

edited

Loading

JiangJiaWei1103 Jan 10, 2026 •

edited

Loading

my-vegetable-has-exploded Jan 10, 2026 •

edited

Loading